This notebook aims to assess our ability to call genomic variants accurately by comparing replicate sequencing runs.
The samples consist of 24 sequencing runs with a Ct value of less than 20. Originally, samples were sequenced once as part of this study. The runs with a low enough Ct value to be considered for re-sequencing by a shotgun metagenomic method were re-sequenced as replicates. One sample is currently excluded from this analysis because it is not included on the SRA (SpID: 10110), leaving 23 final samples.
By comparing the allele frequency of variants called in both replicates, we hope to assess the accuracy, the optimal frequency cutoff, and the optimal coverage cutoff for downstream bottleneck analyses.
The critical difference between the replicates is the processing. The samples on the SRA underwent a slightly different pre-processing pipeline to remove human reads and trim adaptors. Eventually, I’ll re-process the raw files to make the pipelines identical.
For this preliminary analysis, I identified variants by two methods. First, I aligned the samples with BWA, and called variants with Varscan (I removed any filters for minimum allele frequency, depth, or supporting reads). Second, I identified each base at every position in the genome, and it’s location on either the forward or reverse strand using Samtools mpileup.
I think that one of the limiting factors in this analysis is the lack of coverage and the disparity in coverage between replicates. I’m not sure if this difference is due to pre-processing or random variation from re-sequencing.
Below, I’m comparing the difference in the number (and %) of mapped reads between replicates. There are some substantial differences between replicates.
There is also a pronounced disparity in the patterns of coverage over the genome between replicates. Below, I’ve plotted the depth of coverage in 50 Kb bins over the genome. I filtered out the last 53 bp (2 bins) of the genome because there is a poly-A track with a tremendous amount of reads that align, which skews the visualization. I will figure out a way to handle these alignments.
Also, interestingly, there is a clear pattern of coverage over the genome in the second replicate (blue) that doesn’t seem so evident in the first. I’m not sure what to make of this yet. Perhaps this pattern is a consequence of using a different pre-processing pipeline than the SRA replicates.
Next, I wanted to identify and remove all consensus variants shared between the samples. Optimally, I could do this in the analysis pipeline upstream of the final alignment. However, by identifying consensus variants at this stage, I can check for any systematic biases or samples that might skew the cohort’s consensus sequence by having a replicate with too little depth over a region.
To identify consensus variants, I filtered for any variants that appear at greater than or equal to 50% frequency in either replicate and counted the number of samples with these variants.
There are two SNPs (indicated in Red) that show up in all 23 samples. There are another ~7 SNPs that are likely consensus variants, but are missing from either one or two samples (or replicates).
I was curious if these SNPs present in 21 or 22 samples were genuinely absent from the remaining samples. I checked if there was a bias in what samples were missing consensus variants.
## [1] "C7564T-SRR11939986"
## [1] "G11083T-"
## [1] "G28376T-SRR11939986"
There are two samples (SRR11939959 and SRR11939986) that are missing three and four variants respectivley.
| Accession | Replicate | Reads Mapped | Percent Mapped | Average Ct |
|---|---|---|---|---|
| SRR11939958 | 2 | 155589 | 20.817534 | 19.885 |
| SRR11939958 | 1 | 454944 | 61.758921 | 19.885 |
| SRR11939959 | 2 | 150975 | 15.068719 | 19.900 |
| SRR11939959 | 1 | 130396 | 3.861836 | 19.900 |
| SRR11939986 | 2 | 1021770 | 54.889163 | 19.025 |
| SRR11939986 | 1 | 85900 | 1.679103 | 19.025 |
Samples SRR11939959 and SRR11939986 have at least one replicate with a very low percentage of reads mapped. The following SNPs were excluded becuase they were missing from a single replicate:
A23403GC14408TC241TG25563TG11083TThese SNPs do appear to be true consensus SNPs. Two SNPs are genuinely missing from both replicates of SRR11939986 and can’t be easily explained by low coverage:
C7564TG28376TTherefore, the following 7 SNPs are the true consensus SNPs and should be masked from downstream analyses.
## [1] "A23403G" "C1059T" "C14408T" "C241T" "C3037T" "G25563T" "G11083T"
Before I compare the SNPs called in the replicates, I wanted to see how many SNPs were called in each sample.
There are three main takeaways from the above plots:
Next, I wanted to assess the variability in the frequency of alternative alleles called in each replicate. Samples with significant variability likely have low viral template number and can be excluded from downstream analyses. Additionally, I should be able to determine which samples need to be sequenced to higher depth.
First, I plotted the most simple case. Here are all of the SNPs called by Varscan excluding the seven consensus variants shared by all samples.
Several samples have nearly no variants shared between replicates at a frequency of less than ~1. These include the following samples:
SRR11939954SRR11939959SRR11939997SRR11939998I labled these samples as the “Worst” quality replicates. Below is a table summarizing their depth and Ct.
| Accession | Replicate | Reads Mapped | Percent Mapped | Average Ct |
|---|---|---|---|---|
| SRR11939954 | 2 | 71229 | 11.105654 | 17.760 |
| SRR11939954 | 1 | 688885 | 96.983302 | 17.760 |
| SRR11939959 | 2 | 150975 | 15.068719 | 19.900 |
| SRR11939959 | 1 | 130396 | 3.861836 | 19.900 |
| SRR11939997 | 2 | 502378 | 26.931036 | 19.845 |
| SRR11939997 | 1 | 418272 | 33.464276 | 19.845 |
| SRR11939998 | 2 | 1861491 | 38.530957 | 19.865 |
| SRR11939998 | 1 | 213574 | 8.680307 | 19.865 |
All four samples have relatively few reads mapped and in the case of SRR11939959, SRR11939997, and SRR11939998 comparatively high Ct values (~19 cycles).
Otherwise, the main takeaway from the plot above is that there are few variants between low frequency (> 5%) and high frequency (> 90%). Also, as expected, high-frequency variants have good concordance between replicates. Therefore, it’s easier to visualize with a log transformation.
Some samples have better concordance between allele frequencies than others. However, the relationship is easier to visualize with insertions and deletions included.
With this, I can break the samples into either “Good” or “Bad” quality groups. The “Bad” quality group excludes the “Worst” quality samples mentioned above.
Finally, I wanted to see if these relationships hold when I include the variants I identified with mpileup. Below, these variants are layered over the variants called by Varscan. The line showing the relationship between allele frequencies is drawn for the Varscan SNPs and is colored by the sample-quality mentioned above.
“Bad” and “Worst” quality replicates do not have concordant allele frequencies. However, it’s not clear for all samples, so I log-transformed the axes hoping to get a better sense.
There is a pretty significant amount of variability regardless of the concordance between allele frequencies called by Varscan. The alleles annotated in the pileup statistics are possible too lenient since I haven’t applied any filters. I noticed by looking at the data that variants shared between mpileup and Varscan have at least seven supporting reads in each replicate. This number isn’t hardcoded into the variant calling by Varscan, but is probably a consequence of the maximum coverage. From looking at the manual for Varscan, that’s the only explanation that I have.
I applied the filter that any allele must be supported by at least 7 reads in either replicate. The remaining alleles show pretty good concordance, even the variants only called from the pileup file.
Samples with very few variants fall both into the “Worst” and “Bad” categories. Since these samples have so few variants, they can’t be assessed based on the variability in allele frequency. The only way to remedy this is to sequence more. However, since most of the “Worst” samples have relatively high Ct values, this might be futile. These samples are as follows:
Next are the samples with comparatively many shared variants. Of these, four have a similar number of variants to the “Good” samples, but more variability. These are the samples I assume have low template diversity. These samples are:
Finally, there are the “Good” samples that probably don’t need much more re-sequencing:
The worst samples have low depth and high Ct. The best samples have high depth and low Ct. With a better statistic for variability than manually binning the runs into categories, it should be easier to correlate the depth and Ct with the replicates’ accuracy. Effective depth should accomplish this.